Coronavirus disease(COVID19) is an infectious disease caused by a newly discovered coronavirus. It has spread to numerous countries across all continent since the first discovery in Wuhan, China back in Nov 2019 and was declared as pandemic by WHO on March 11.
Various countries has came out measure/restrictions to respond to COVID-19. Since “circuit breaker”, a partial nationwide lockdown, where only essential services could allow opened, SG residents have started to feel a great impact on daily life where they are encouraged to stay home as much as possible and wearing of mask became mandatory when going out. SG government has constantly revising policy and social restrictions. Three phases of planned reopening were announced on 19 May namely “Safe Reopening” (Phase1) “Safer Transition” (Phase2), and finally “Safe Nation” (Phase3).
Microblogging has become one of the most useful tools for sharing everyday life events and news and for expressing opinions about those events. As Twitter posts are short and constantly being generated, they are a great source for providing public sentiment towards events that occurred throughout the COVID-19 period in Singapore.
In our Capstone Project, we perform exploratory data analysis about SG COVID situation and sentiment analysis and modeling on the tweets about COVID19 to seek to answer the following research questions:
What are the main prevalent sentiment and emotions expressed in words in Singapore tweets about current COVID situation?
Is there any change of sentiment over a period of time amidst global reopening with higher vaccination rate, in contrast growing new daily cases/death locally?
For our data science project, we activated the following packages, using the Tidyverse approach.
# Load necessary packages
pacman::p_load(tidyverse, broom, modelr, lubridate,
tidytext, wordcloud2, wordcloud, reshape2,
textdata, # Employing Lexicon
jtools, huxtable, gridExtra,
psych, Hmisc, car, sandwich,
ggthemes, scales, ggstance,
gvlma, gtrendsR, rtweet, glue)
my_colors <- c("#05A4C0", "#85CEDA", "#D2A7D8", "#A67BC5", "#BB1C8B", "#8D266E", "gold4", "darkred", "deepskyblue4")
my_theme <- theme(plot.background = element_rect(fill = "grey98", color = "grey20"),
panel.background = element_rect(fill = "grey98"),
panel.grid.major = element_line(colour = "grey87"),
text = element_text(color = "grey20"),
plot.title = element_text(size = 22),
plot.subtitle = element_text(size = 17),
axis.title = element_text(size = 15),
axis.text = element_text(size = 15),
legend.box.background = element_rect(color = "grey20", fill = "grey98", size = 0.1),
legend.box.margin = margin(t = 3, r = 3, b = 3, l = 3),
legend.title = element_blank(),
legend.text = element_text(size = 15),
strip.text = element_text(size=17))Then, we imported our dataset.
Data Source 1: SG COVID DATA
SSA <- read.csv("Covid-19.csv")Within the dataset, there are five questions about satisfaction with life (SWL), which will serve as our dependent (continuous y) variable.
Also, the dataset contains social comparisons orientation (11 items). They were asked by the following question.
Most people compare themselves from time to time with others. For example, they may compare the way they feel, their opinions, their abilities, and/or their situation with those of other people. Here is nothing particularly ‘good’ or ‘bad’ about this type of comparison, and some people do it more than others.
We would like to find out how often you compare yourself with other people. To do that we would like to ask you to indicate how much you agree with each statement below.
Out of the 11 questions, six were about social comparison regarding abilities (SCA, which will serve as our dependent variable (continuous x1); whereas the other five were about social comparisons regarding opinions (SCB).
Our moderator (categorical x2), gender, is contained as follows: 1 being male; 2 being female.
glimpse(SSA)## Rows: 999
## Columns: 37
## $ Date <chr> "2020-01…
## $ Daily.Confirmed <int> 1, 2, 1,…
## $ False.Positives.Found <int> NA, NA, …
## $ Cumulative.Confirmed <int> 1, 3, 4,…
## $ Daily.Discharged <int> 0, 0, 0,…
## $ Passed.but.not.due.to.COVID <int> 0, 0, 0,…
## $ Cumulative.Discharged <int> 0, 0, 0,…
## $ Discharged.to.Isolation <int> 0, 0, 0,…
## $ Still.Hospitalised <int> 1, 3, 4,…
## $ Daily.Deaths <int> 0, 0, 0,…
## $ Cumulative.Deaths <int> 0, 0, 0,…
## $ Tested.positive.demise <int> 0, 0, 0,…
## $ Daily.Imported <int> 1, 2, 1,…
## $ Daily.Local.transmission <int> 0, 0, 0,…
## $ Local.cases.residing.in.dorms.MOH.report <int> NA, NA, …
## $ Local.cases.not.residing.in.doms.MOH.report <int> NA, NA, …
## $ Intensive.Care.Unit..ICU. <int> 0, 0, 0,…
## $ General.Wards.MOH.report <int> NA, NA, …
## $ In.Isolation.MOH.report <int> NA, NA, …
## $ Total.Completed.Isolation.MOH.report <int> NA, NA, …
## $ Total.Hospital.Discharged.MOH.report <int> NA, NA, …
## $ Requires.Oxygen.Supplementation.or.Unstable <int> NA, NA, …
## $ Linked.community.cases <int> NA, NA, …
## $ Unlinked.community.cases <int> NA, NA, …
## $ Phase <chr> "", "", …
## $ Cumulative.Vaccine.Doses <int> NA, NA, …
## $ Cumulative.Individuals.Vaccinated <int> NA, NA, …
## $ Cumulative.Individuals.Vaccination.Completed <int> NA, NA, …
## $ Perc.population.completed.at.least.one.dose <chr> "", "", …
## $ Perc.population.completed.vaccination <chr> "", "", …
## $ Sinovac.vaccine.doses <int> NA, NA, …
## $ Cumulative.individuals.using.Sinovac.vaccine <int> NA, NA, …
## $ Doses.of.other.vaccines.recognised.by.WHO <int> NA, NA, …
## $ Cumulative.individuals.using.other.vaccines.recognised.by.WHO <int> NA, NA, …
## $ Number.taken.booster.shots <int> NA, NA, …
## $ Perc.population.taken.booster.shots <chr> "", "", …
## $ X <lgl> NA, NA, …
Data Source 1: Tidy & Transform
The first thing we did with our loaded dataset was to created three variables (SWL, SCA, SCB) that contain the mean value of the items for each variable. Also, we recoded the gender variable.
SSA1<- tibble(SSA)
class(SSA)## [1] "data.frame"
# View(SSA1)
SSA1 <- SSA1[-(1:626) , -(18:37)]
SSA1 <- SSA1[ , -(11:16)]
SSA1 <- SSA1[ , -(3:8)]
SSA1 <- SSA1[-(35:373) , ]
SSA1$Date <- as.Date(SSA1$Date)
SSA1 %>%
ggplot(aes(x=Date)) +
geom_line(aes(y = Daily.Confirmed), color = "darkred") +
geom_line(aes(y = Daily.Deaths), color="yellow") +
geom_line(aes(y = Still.Hospitalised), color="blue")+
geom_line(aes(y = Intensive.Care.Unit..ICU.), color="green")glimpse(SSA1)## Rows: 34
## Columns: 5
## $ Date <date> 2021-10-10, 2021-10-11, 2021-10-12, 2021-10…
## $ Daily.Confirmed <int> 2809, 2263, 2976, 3190, 2932, 3445, 3348, 30…
## $ Still.Hospitalised <int> 1583, 1668, 1589, 1477, 1481, 1563, 1434, 16…
## $ Daily.Deaths <int> 9, 10, 11, 9, 15, 8, 9, 9, 6, 7, 18, 16, 14,…
## $ Intensive.Care.Unit..ICU. <int> 41, 42, 42, 46, 46, 48, 62, 66, 67, 71, 67, …
SSA1 %>%
select("Daily.Confirmed", "Daily.Deaths","Still.Hospitalised", "Intensive.Care.Unit..ICU."
) %>%
summary(.)## Daily.Confirmed Daily.Deaths Still.Hospitalised Intensive.Care.Unit..ICU.
## Min. :1767 Min. : 6.00 Min. :1434 Min. :41.00
## 1st Qu.:2943 1st Qu.: 9.00 1st Qu.:1584 1st Qu.:58.25
## Median :3182 Median :12.00 Median :1640 Median :64.00
## Mean :3206 Mean :12.03 Mean :1633 Mean :61.88
## 3rd Qu.:3472 3rd Qu.:14.75 3rd Qu.:1686 3rd Qu.:68.50
## Max. :5324 Max. :18.00 Max. :1757 Max. :75.00
# SSA1 %>%
# select("Daily.Confirmed", "Daily.Deaths","Still.Hospitalised", "Intensive.Care.Unit..ICU."
# ) %>%
# alpha(.)Data Source 2: SG TWEETER DATA
# We observed 7-days data usually capped below 3000 tweets per extraction.
# sg_tweets <- search_tweets(q = "#coronavirus OR #covid19 OR #COVID OR #stayhome OR #Covid-19 OR #pandemic OR #virus OR #social-distance OR #self-quarantine OR $swab-test OR #PCR OR #infection-rate",
# n = 3000,
# lang = "en",
# include_rts = F,
# geocode = lookup_coords("singapore")
sg_tweets <- read.csv("covid19_sg_tweeter_1010_1115.csv")Let’s explore our tweets data with regards to COVID-19 since our first extraction on 18th October to understand sentiment after recent sharp rise in number of local cases and death since end-September.
We also identified 2 key events over the period to analyse further to answer our research question if the event will have significance influence on the public sentiment.
2021-10-20
2021-11-08
# Basic EDA on amount of tweet in time (ALL)
options(repr.plot.width=20, repr.plot.height=9)
sg_tweets %>%
select(created_at) %>%
mutate(date = ymd(as.Date(created_at))) %>%
group_by(date) %>%
summarise(n = n(), .groups = "drop_last") %>%
ggplot(aes(x=date, y = n)) +
geom_line(size = 1, color = my_colors[1]) +
coord_cartesian(clip = 'off') +
geom_vline(xintercept = as.Date('2021-10-20'), linetype="dotted", size = 1.5, color = "red") +
geom_vline(xintercept = as.Date('2021-11-08'), linetype="dotted", size = 1.5, color = "darkblue") +
my_theme + theme(axis.title.x = element_blank()) +
scale_x_date(date_breaks = "1 day") +
ggthemes::theme_fivethirtyeight() +
theme(axis.text.x = element_text(angle = 45, size = rel(0.6), vjust = 1, hjust = 1 )) +
labs(title = "Number of COVID-19 Tweets shared between 10th Oct - 15th Nov", subtitle = "Number of tweets spiked on key events") +
geom_label(aes(x=as.Date('2021-10-19'), y = 380, label = "PM Lee's address on COVID-19"), color = "red", size = 4, angle = 90, fontface = "bold") +
geom_label(aes(x=as.Date('2021-11-07'), y = 380, label = "More Opening on COVID-19 restrictions" ), color = "darkblue", size = 4, angle = 90, fontface = "bold") Data Source 2: Tidy & Transform
# Step 1: Tokenization ----
sg_tweets_id <- sg_tweets %>%
mutate(created_at = as.Date(sg_tweets$created_at)) %>%
rowid_to_column("id")
tidy_tweets_token <- sg_tweets_id %>%
drop_na(text) %>%
select(id, created_at, status_id, text) %>%
filter(text != "") %>%
unnest_tokens(word, text, token = "tweets")
# Step 2: Cleaning ----
tweets_cleaned <- tidy_tweets_token %>%
anti_join(tidytext::stop_words)
# Manual cleaning, filtering out unwanted words
tweets_token_cleaned <- tweets_cleaned %>%
filter(!word %in% c("singapore", "covid", "covid19","positive","negative","oct","nov","news","amp","reuters","news","daily","malaysia","november","october","october","press","journal","amid","weekly","days","weeks","china","chinese","report","am","pm","dont","taking","found","morning","bloomberg","months","month","india","media","week","read","reports","data","europe","monday","tuesday","wednesday","thursday","friday","satursday","sunday","wall","street") & !str_detect(word,"^#|^@") & !str_detect(word, "^[:digit:]+$"))covid_tweets_for_wc <- tweets_token_cleaned %>%
group_by(word) %>%
summarise(frequency = n()) %>%
arrange(desc(frequency))
covid_tweets_for_wc %>%
filter(frequency > 3) %>%
wordcloud2(backgroundColor = "black",
color = "random-light")# A Postive-Negative Word Cloud by using BING
BING <- get_sentiments("bing")
tweets_token_cleaned %>%
inner_join(BING, by="word") %>%
count(word, sentiment, sort=T) %>%
acast(word ~ sentiment, value.var = "n", fill=0) %>%
comparison.cloud(colors=my_colors[c(5, 1)], max.words = 400, title.size = 2,
scale = c(3,.5))AFINN <- get_sentiments("afinn")
## TOP 3 MOST NEGATIVE TWEET ----
tweets_AFINN_indexed <- tweets_token_cleaned %>%
inner_join(AFINN, by = "word")
tweet_level_sentiment <- tweets_AFINN_indexed %>%
group_by(id) %>%
summarise(average_sentiment = mean(value),
n_of_words_indexed = n()
)
top3_negative <- tweet_level_sentiment %>%
arrange(average_sentiment) %>%
head(3)
sg_tweets_id %>%
filter(id %in% top3_negative$id ) %>%
select(text) %>%
pull()## [1] "'They don't choose to have it': Why Covid-19 is hell for people with OCD"
## [2] "Asked my mum to get me donuts 🍩 I’ve no idea why i did that when I can’t taste no shit. Arghh but just cravings 😫 fuck covid 19, fuck the monthly cycle, fuck everything! 😖"
## [3] "deadass my eyes went from covid 19 then positive bitch"
# TOP 3 MOST POSITIVE TWEETS ----
top3_positive <- tweet_level_sentiment %>%
arrange(desc(average_sentiment)) %>%
head(3)
sg_tweets_id %>%
filter(id %in% top3_positive$id) %>%
select(text) %>%
pull()## [1] "…or else they find alternative employment. They are also bringing in a vax passport to enter some venues buildings etc, unless there is a current Covid-19 test (often valid only for a day or 2). It's basically becoming a no vax, no job, & no fun situation. This I approve."
## [2] "“MOH said the doctor had no known medical conditions and was partially vaccinated with a non-mRNA Covid-19 vaccine under the Special Access Route.”\n\nWow, this has to be stated so explicitly. I wonder why? \n\n @Huigoon"
## [3] "@DavidBieleski @bergeron_brent @katiehasedits This trend was evident on 9Aug 2020, Back then out of the 3, SG had the most cases with 55,104, whilst NZ, the least with 1,219. Even back then Greece the highest with 213 COVID-19 related deaths.\n\nFast forward to 5 Nov 2021, Greece is the winner in Covid-19 cases & deaths."
Distribution Breakdown by Emotion Class using NRC technique
NRC <- get_sentiments("nrc")
options(repr.plot.width=15, repr.plot.height=9)
tweets_token_cleaned %>%
inner_join(NRC, "word") %>%
filter(!sentiment %in% c("positive", "negative")) %>%
count(sentiment, sort=T) %>%
ggplot(aes(x=reorder(sentiment, n), y=n)) +
geom_bar(stat="identity", aes(fill=n), show.legend=F) +
geom_label(aes(label=format(n, big.mark = ",")), size=5, fill="white") +
labs(x="Sentiment", y="Frequency", title="What is the overall mood in Tweets?") +
scale_fill_gradient(low = my_colors[3], high = my_colors[1], guide="none") +
coord_flip() +
my_theme + theme(axis.text.x = element_blank())Most Fequent Words by Emotion Class
#options(repr.plot.width=25, repr.plot.height=9)
tweets_token_cleaned %>%
inner_join(NRC, "word") %>%
count(sentiment, word, sort=T) %>%
group_by(sentiment) %>%
arrange(desc(n)) %>%
slice(1:7) %>%
ggplot(aes(x=reorder(word, n), y=n)) +
geom_col(aes(fill=sentiment), show.legend = F) +
facet_wrap(~sentiment, scales = "free_y", nrow = 2, ncol = 5) +
coord_flip() +
my_theme + theme(axis.text.x = element_blank()) +
labs(x="Word", y="Frequency", title="Sentiment split by most frequent words") +
scale_fill_manual(values = c(my_colors, "#BE82AF", "#9D4387", "#DEC0D7",
"#40BDC8", "#80D3DB", "#BFE9ED"))20 Oct 2021
Here, we are interested in a research question: Did COVID-19 key events within our sentiment analysis period on 20 Oct occured below changed the public sentiment or gain more trust in effects with the leadership?
We are going to use Regression Discontinuity Analysis on the causal inference and effect.
Firstly, we explore the data with 10 days before and after PM Lee’s address, assuming date close to the cut off on 20 Oct has more relevant effects.
Using AFINN technique
sentiment_daily <- tweets_AFINN_indexed %>%
group_by(created_at) %>%
summarise(average_sentiment = mean(value),
n_of_words_indexed = n())
# Plot
sentiment_daily %>%
filter(created_at >= as.Date('2021-10-10') & created_at <= as.Date('2021-10-30')) %>%
ggplot(aes(x = created_at, y = average_sentiment) ) +
geom_point(size = 2, color = my_colors[1]) +
geom_vline(xintercept = as.Date('2021-10-20'), size = 1, linetype="dotdash", color = my_colors[6]) +
scale_x_date(date_breaks = "1 day") +
ggtitle("Distribution of Average Sentiment \n10 days before & after PM address") +
ggthemes::theme_fivethirtyeight() +
theme(axis.text.x = element_text(angle = 45, size = rel(0.6), vjust = 1, hjust = 1 ))Using NRC technique for Emotion Classification
# Extract for analysis period
tweets_token_analysis_period <- tweets_token_cleaned %>%
filter(created_at >= as.Date('2021-10-10') & created_at <= as.Date('2021-10-30'))
classified_sentiment <- tweets_token_analysis_period %>%
inner_join(NRC, "word") %>%
group_by(sentiment, created_at) %>%
summarise(cnt = n())
# Plot Chart
classified_sentiment %>%
filter(!sentiment %in% c("positive", "negative")) %>%
ggplot(aes(x=created_at, y =cnt, color = sentiment)) +
geom_point() +
facet_wrap(~sentiment, scales = "free_y", nrow = 2, ncol = 4) +
geom_vline(xintercept = as.Date('2021-10-20'), size = 1,linetype="dotdash", color = my_colors[8]) +
scale_x_date(breaks = c(as.Date('2021-10-10'), as.Date('2021-10-20'), as.Date('2021-10-30')), date_labels = "%b %d") +
theme(axis.text.x = element_text(angle = 45, size = rel(0.8), vjust = 1, hjust = 1 )) +
labs(y="Count of Emotional Words", x="Period of Date")Using Radar Chart, another visualisation chart.
# Data transformation
char_sentiment <- classified_sentiment %>%
filter(!sentiment %in% c("positive", "negative")) %>%
mutate (covid_address_effect = as.factor(ifelse(created_at < '2021-10-20','Before','After'))) %>%
group_by(sentiment, covid_address_effect) %>%
summarise(char_sentiment_count = sum(cnt))
total_char <- classified_sentiment %>%
filter(!sentiment %in% c("positive", "negative")) %>%
mutate (covid_address_effect = as.factor(ifelse(created_at < '2021-10-20','Before','After'))) %>%
group_by(covid_address_effect) %>%
summarise(total = sum(cnt))
# Plot Chart
char_sentiment %>%
inner_join(total_char, by="covid_address_effect") %>%
mutate(percent = char_sentiment_count / total * 100 ) %>%
select(-char_sentiment_count, -total) %>%
spread(covid_address_effect, percent) %>%
radarchart::chartJSRadar(showToolTipLabel = T, main="Any Effects on the Emotion Class Percentage After Address? - No", maxScale=25, responsive=T,addDots = T, colMatrix = grDevices::col2rgb(my_colors[c(2,4)]),lineAlpha = 0.7, polyAlpha =0.2)OLS Linear Regression on average sentiment over time period
merged_dataset <- SSA1 %>%
inner_join(sentiment_daily, by = c("Date" = "created_at")) %>%
filter(Date >= as.Date('2021-10-10') & Date <= as.Date('2021-10-30'))
# add dummy variable for pre-effect = 0, and post-effect = 1
merged_dataset <- merged_dataset %>%
mutate (covid_address_effect = as.factor(ifelse(Date < '2021-10-20','Before','After')))
merged_dataset %>%
lm(average_sentiment ~ covid_address_effect + I(Date - as.Date('2021-10-20')),.) %>%
summary()##
## Call:
## lm(formula = average_sentiment ~ covid_address_effect + I(Date -
## as.Date("2021-10-20")), data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.36397 -0.09578 0.01303 0.08805 0.32045
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.340705 0.080718 -4.221 0.000514 ***
## covid_address_effectBefore -0.238246 0.150119 -1.587 0.129913
## I(Date - as.Date("2021-10-20")) -0.007743 0.012382 -0.625 0.539602
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1718 on 18 degrees of freedom
## Multiple R-squared: 0.2093, Adjusted R-squared: 0.1214
## F-statistic: 2.382 on 2 and 18 DF, p-value: 0.1209
merged_dataset %>%
ggplot(aes(x = Date, y = average_sentiment)) +
geom_point(aes(color = covid_address_effect)) +
geom_smooth(method = "lm") +
scale_x_date(breaks = c(as.Date('2021-10-10'), as.Date('2021-10-20'), as.Date('2021-10-30')), date_labels = "%b %d", date_minor_breaks = "1 day") +
ggthemes::theme_fivethirtyeight() +
ylab("Average Sentiment") +
theme(axis.title.y = element_text(), legend.position = "bottom") +
labs(title="OLS Simple Regression Model") ### Regression Discontinuity Analysis
We perform Regression Discontinuity Analysis on the effects of PM address event.
We expect to observe there is a high volume on Oct. 20, jump in sentiment score/count on Oct. 19 and Oct. 21
# RDD
RDD <- merged_dataset %>%
ggplot(aes(x = Date, y = average_sentiment, color = covid_address_effect)) +
geom_point() +
geom_smooth(method = "lm") +
ggthemes::theme_fivethirtyeight() +
ggtitle("Regression Discontinuity Analysis") +
ylab("Average Sentiment") +
theme(axis.title.y = element_text()) +
scale_x_date(breaks = c(as.Date('2021-10-10'), as.Date('2021-10-20'), as.Date('2021-10-30')), date_labels = "%b %d", date_minor_breaks = "1 day") +
geom_vline(xintercept = as.Date('2021-10-20'), size = 1,linetype="dotdash", color = my_colors[8])
# Difference in Means Test
RDD_box <- merged_dataset %>%
ggplot(aes(x = Date, y = average_sentiment, color = covid_address_effect)) +
geom_boxplot(outlier.colour="black",
outlier.size=2, notch=FALSE) +
geom_point() +
ggthemes::theme_fivethirtyeight() +
ggtitle("Test for Significant Difference") +
scale_x_date(breaks = c(as.Date('2021-10-10'), as.Date('2021-10-20'), as.Date('2021-10-30')), date_labels = "%b %d", date_minor_breaks = "1 day") +
geom_vline(xintercept = as.Date('2021-10-20'), size = 1,linetype="dotdash", color = my_colors[8])
gridExtra::grid.arrange(RDD, RDD_box, ncol=2)Perform T-test to find significance in difference between 2 groups (Before and After PM address)
# Conduct a difference of means test
# Hypothesis: H0 : mean of pre-address_effect = mean of post-address_effect
merged_dataset %>%
t.test(average_sentiment ~ covid_address_effect, .)##
## Welch Two Sample t-test
##
## data: average_sentiment by covid_address_effect
## t = 2.1578, df = 18.219, p-value = 0.04453
## alternative hypothesis: true difference in means between group After and group Before is not equal to 0
## 95 percent confidence interval:
## 0.004272338 0.309624417
## sample estimates:
## mean in group After mean in group Before
## -0.3794175 -0.5363659
For the preparation of the model, we created and ran a correlational matrix, to see how our variables of interest (within the model) are related.
pacman::p_load(Hmisc, broom, DT)
SSA1 %>%
select(Daily.Confirmed, Daily.Deaths, Still.Hospitalised,Intensive.Care.Unit..ICU.) %>%
as.matrix(.) %>%
rcorr(.) %>%
tidy(.) %>%
rename(variable_1 = column1,
variable_2 = column2,
corr = estimate) %>%
mutate(abs_corr = abs(corr)
) %>%
datatable(options = list(scrollX = T),
) %>%
formatRound(columns = c("corr", "p.value", "abs_corr"),
digits = 3)We performed mean-centering transformations on all the variables that will be turned into interaction terms.
data <- SSA1 %>%
select(Daily.Confirmed, Daily.Deaths, Still.Hospitalised,Intensive.Care.Unit..ICU.) %>%
mutate_at(vars(Intensive.Care.Unit..ICU.:Still.Hospitalised),
~(. - mean(.,na.rm=T)))We ran two regression models. The first regressed demographics and social comparisons orientations onto life satisfaction (model1).
\[ \begin{eqnarray} \widehat{swl} = intercept + b_1fem + b_2ageyear + b_3educ + b_4income + b_5sca + b_6scb + \epsilon \end{eqnarray} \]
Our key investigation lies in the next model, in which we regressed demographics and social comparisons orientations, along with interaction terms, onto life satisfaction (model2).
\[ \begin{eqnarray} \widehat{swl} = intercept + b_1fem + b_2ageyear + b_3educ + b_4income + b_5sca + b_6scb + \\ + b_7fem \times sca + b_8ageyear \times sca + b_9educ \times sca + b_10educ \times sca + \epsilon \end{eqnarray} \]
model1 <- lm(Daily.Deaths ~ Daily.Confirmed + Still.Hospitalised + Intensive.Care.Unit..ICU. ,
data)
tidy(model1) %>% as_tibble()| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 10.8 | 3.19 | 3.38 | 0.00202 |
| Daily.Confirmed | 0.000382 | 0.000978 | 0.391 | 0.699 |
| Still.Hospitalised | -0.000554 | 0.00909 | -0.061 | 0.952 |
| Intensive.Care.Unit..ICU. | 0.0588 | 0.0701 | 0.839 | 0.408 |
glance(model1)| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.0332 | -0.0635 | 3.55 | 0.343 | 0.794 | 3 | -89.2 | 188 | 196 | 378 | 30 | 34 |
model2 <- lm(Daily.Deaths ~ (Daily.Confirmed + Still.Hospitalised) * Intensive.Care.Unit..ICU. ,
data)
tidy(model2) %>% as_tibble()| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 10.9 | 3.5 | 3.11 | 0.00429 |
| Daily.Confirmed | 0.000382 | 0.0011 | 0.346 | 0.732 |
| Still.Hospitalised | -0.00169 | 0.0103 | -0.165 | 0.87 |
| Intensive.Care.Unit..ICU. | -0.0179 | 0.46 | -0.0389 | 0.969 |
| Daily.Confirmed:Intensive.Care.Unit..ICU. | 2.39e-05 | 0.000157 | 0.152 | 0.88 |
| Still.Hospitalised:Intensive.Care.Unit..ICU. | -0.000269 | 0.00115 | -0.234 | 0.817 |
glance(model2)| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.0361 | -0.136 | 3.67 | 0.21 | 0.956 | 5 | -89.1 | 192 | 203 | 377 | 28 | 34 |
We tested if model2, with interaction terms, enhances the explanatory power of the model using anova function.
anova(model1, model2)| Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) |
|---|---|---|---|---|---|
| 30 | 378 | ||||
| 28 | 377 | 2 | 1.15 | 0.0427 | 0.958 |
The results of the analysis suggest that adding the interaction terms significantly increases the R-squared of model2, as compared to model1.
Prof. Roh’s Note: “Here, please check the linearity assumption, using Global Validation of Linear Model Assumption (
gvlma) package. You may visualize the core infomation of assumption checks, usingggfortifypackage.”
library(gvlma)
gvlma(model2)##
## Call:
## lm(formula = Daily.Deaths ~ (Daily.Confirmed + Still.Hospitalised) *
## Intensive.Care.Unit..ICU., data = data)
##
## Coefficients:
## (Intercept)
## 1.087e+01
## Daily.Confirmed
## 3.821e-04
## Still.Hospitalised
## -1.692e-03
## Intensive.Care.Unit..ICU.
## -1.792e-02
## Daily.Confirmed:Intensive.Care.Unit..ICU.
## 2.393e-05
## Still.Hospitalised:Intensive.Care.Unit..ICU.
## -2.693e-04
##
##
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance = 0.05
##
## Call:
## gvlma(x = model2)
##
## Value p-value Decision
## Global Stat 2.4730 0.6495 Assumptions acceptable.
## Skewness 0.1038 0.7473 Assumptions acceptable.
## Kurtosis 1.7981 0.1799 Assumptions acceptable.
## Link Function 0.4116 0.5212 Assumptions acceptable.
## Heteroscedasticity 0.1595 0.6896 Assumptions acceptable.
library(ggthemes)
theme_set(theme_fivethirtyeight())
library(ggfortify)
autoplot(gvlma(model2))library(car)
vif(model1);vif(model2)## Daily.Confirmed Still.Hospitalised Intensive.Care.Unit..ICU.
## 1.025765 1.226518 1.211086
## Daily.Confirmed
## 1.222041
## Still.Hospitalised
## 1.463277
## Intensive.Care.Unit..ICU.
## 48.865490
## Daily.Confirmed:Intensive.Care.Unit..ICU.
## 52.250322
## Still.Hospitalised:Intensive.Care.Unit..ICU.
## 1.460227
kable in R MarkdownProf. Roh’s Note: “Now that the assumption check is done, you might want to put the results into a prettier format of table. The default print-out of table in
R Markdowndoes not look good. Theknitrpackage contains a very basic command,kable, which will format an array or data frame more presentable for display. Thus, use the following for your report.”
library(knitr) # Please install the package "knitr" first.
library(kableExtra) # You might want to use package "kableExtra" as well.##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:huxtable':
##
## add_footnote
## The following object is masked from 'package:dplyr':
##
## group_rows
kable(tidy(model2))%>%
kable_styling("striped", full_width = T, fixed_thead = T) %>%
column_spec(c(1, 5), bold = T) %>%
row_spec(c(2, 4, 6), bold = T, color = "white", background = "#ff6347")| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 10.8686811 | 3.4965288 | 3.1084203 | 0.0042883 |
| Daily.Confirmed | 0.0003821 | 0.0011032 | 0.3463539 | 0.7316660 |
| Still.Hospitalised | -0.0016917 | 0.0102596 | -0.1648897 | 0.8702155 |
| Intensive.Care.Unit..ICU. | -0.0179243 | 0.4601972 | -0.0389492 | 0.9692072 |
| Daily.Confirmed:Intensive.Care.Unit..ICU. | 0.0000239 | 0.0001573 | 0.1521081 | 0.8801924 |
| Still.Hospitalised:Intensive.Care.Unit..ICU. | -0.0002693 | 0.0011515 | -0.2338565 | 0.8167978 |
kable(glance(model2))%>%
kable_styling("striped", full_width = T, font_size = 12) %>%
column_spec(c(2, 4, 6), bold = T, color = "white", background = "#ff6347")| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.0360927 | -0.1360335 | 3.668687 | 0.2096876 | 0.9555829 | 5 | -89.13761 | 192.2752 | 202.9597 | 376.8594 | 28 | 34 |
The regression analysis came up with two significant interaction terms.
First, it appears that the relationships between social comparisons orientation regarding abilities and life satisfaction is different depending on one’s gender.
Second, it appears that the relationships between social comparisons orientation regarding abilities and life satisfaction is different depending on one’s education.
To see the patterns of interaction, we visualized the significant interaction effects on the next chapter.
Unlike gender, which is a categorical variable, education is a continuous variable in our model. To make it easier for interpretation, we categorized them into three levels (mean below 1SD, mean, and mean above 1SD). We set gender at 0.5 instead of 0, as the ratio of male and female is equal.
# data2 %>% summarise(sd(educ))
#
# grid_group3 <- data2 %>%
# data_grid(sca, educ = c(-1.26, 0.00, 1.26),
# fem = 0.5, ageyear = 0, income = 0, scb = 0) %>%
# add_predictions(model2)We undid the centering of variable (sca).
# grid <- grid_group3 %>%
# mutate(sca = sca + mean(data$sca), educ = factor(as.double(factor(educ))))The following figure represents the three lines that represent differing education levels set at M-1SD, Mean, M+1SD, as noted above, and how differing education levels make differences to relationships between social comparison orientation and life satisfaction.
# ggplot(grid, aes(x = sca, y = pred, color = factor(educ))) +
# geom_line(size = 2) +
# scale_color_discrete(breaks = c(1, 2, 3),
# label=c("Low in Education",
# "Mean Education",
# "High in Education")) +
# labs(x = "Social Comparison Orientation (Abilities)",
# y = "Life Satisfaction",
# color = "Education") +
# coord_cartesian(ylim = c(2.0, 3.5), xlim = c(0.7, 5.3)) +
# theme_linedraw() +
# theme(legend.position= "top")Prof. Roh’s Note: “Yes, you might want to perform the analysis above, using
jtools,huxtable,ggstance, andinteractionspackages, as you have learned in class.”
# pacman::p_load(jtools, huxtable, ggstance, interactions)
#
# m1 <- lm(swl ~ fem + ageyear + educ + income + sca + scb,
# data = data)
#
# m2 <- lm(swl ~ (fem + ageyear + educ + income) * sca + scb,
# data = data)
#
# export_summs(m1, m2,
# error_format = "(t = {statistic}, p = {p.value})",
# align = "right",
# model.names = c("Main Effects Only", "with Interactions"),
# digits = 3)
#
# plot_summs(m1, m2,
# scale = T,
# plot.distributions = T,
# model.names = c("Main Effects Only", "with Interactions")) +
# theme(legend.position = "top")
#
# sim_slopes(m2,
# pred = sca,
# modx = fem,
# johnson_neyman = F)
#
# sim_slopes(m2,
# pred = fem,
# modx = sca,
# johnson_neyman = T)
#
# set.seed(20210427)
#
# interact_plot(m2,
# pred = "sca",
# modx = "fem",
# modx.labels = c("Male",
# "Female"),
# interval = T,
# int.width = 0.95,
# colors = c("tomato3",
# "deepskyblue4"),
# vary.lty = T,
# line.thickness = 1,
# legend.main = "Gender",
# plot.points = T,
# jitter = 0.1) +
# geom_vline(xintercept = 3.19, col = "red", linetype = 1, size = 1) +
# annotate("text",
# x = 2.1,
# y = 4.7,
# label = paste0("The shaded areas denote the boundary\nbetween regions ",
# "of significance and\nnon-significance based on alpha at 5%")
# ) +
# annotate("rect",
# fill = "yellow",
# alpha = 0.1,
# xmin = 3.19,
# xmax = 5,
# ymin = 1,
# ymax = 5) +
# labs(title = "The Interplay of Gender and Social Comparison\non Life Satisfaction",
# subtitle = paste0("Among the female, the more they compare their abilities ",
# "with others,\nthere seems to be lesser life satisfaction."),
# caption = "Source: Your Dataset",
# x = "Social Comparison of Abilities (1-5)",
# y = "Life Satisfaction (1-5)") +
# ggthemes::theme_fivethirtyeight() +
# theme(legend.position = "top",
# text = element_text(family = "Courier"),
# axis.text.y = element_text())From our data science project, we could find the following two findings:
The first case of COVID-19 found in Singapore was confirmed on 23 January 2020, and since nationwide partial lockdown or circuit breaker kicked in from 7 April 2020 to 1 June 2020, Singapore have been experiencing policies change from time to time to effectively cope with the surges. As new norm has been rooted in local lifestyle, no apparent jump or trend has been observed in public sentiment. In fact, we further analyse the effects from the most recent key announcements, a sudden increase of number of tweets and jump in overall sentiment appears that there is a causation effect, but it did not drive any specific emotion class to form trending. Despite that social curb extension announced expected to cause negative sentiment, PM address has outweighed the effects on it and sentiment emerge with more trusts to the leadership. Thus, information delivering, create awareness and generate leads seems to be effective in producing positive public sentiment.
Prof. Roh’s Note: “This is where you provide the significance of the findings. Unlike the other sections, where your goal is to describe the results that you found (what the data told you). This is where you chime in and proactively discuss the meaning of the results.”
The public sentiment reflected and analysed by tweets data especially effective among younger age group who spend most of their time in social media and carefully tweets what is in their mind. This can serve only a small sample of a whole local population. If possible, we also need to filter out news publisher tweets and focus on individual tweets. In here, We are comparing sentiment change within group based on causal inference driven by nation-wide speech and announcements. On how COVID-19 has changed the sentiment in the society, we have to identify and aware of timeline on key events happened and also view from a longer term from pre-covid and covid era, hence a relative longer time period of data will provide more insights for analysing the differences. We are hold back by the free Twitter developer account which eligible to extract tweets up to 7 days in the past.
Local slang words or Singlish are casually and frequently used among local community and most of them could be a good gauge of emotion, E.g. even a dialect vulgar should reflect anger, thus enriching the emotion dictionary like NRC can better tune to understand the positive and negative in sentiment.
Feature Importance describe which features are relevant and is another aspect we should explore to help us better understanding of solved problem and sometimes lead to model improvements for accuracy by employing feature selection.
[1] Julia Silge & David Robinson Text Mining with R - A TIDY APPROACH O’Reilly (2017)
[2] Andrea Cirillo R Data Mining - Implement data mining techniques through practical use cases and real-world datasets Packt> (2017)
[3] Tony Carilli R Companion to Real Econometrics February 2021 https://bookdown.org/carillitony/bailey/chp11.html
[4] Ashwin Malshe(2019-06-25) Data Analytics Applications https://ashgreat.github.io/analyticsAppBook/collect-tweets.html
Custom Stop-words for Text Pre-processing in Word Cloud data overview “singapore”, “covid”, “covid19”,“positive”,“negative”,“oct”,“nov”,“news”,“amp”,“reuters”,“news”,“daily”, “malaysia”,“november”,“october”,“october”,“press”,“journal”,“amid”,“weekly”,“days”,“weeks”,“china”, “chinese”,“report”,“am”,“pm”,“dont”,“taking”,“found”,“morning”,“bloomberg”,“months”,“month”,“india”, “media”,“week”,“read”,“reports”,“data”,“europe”,“monday”,“tuesday”,“wednesday”,“thursday”,“friday”, “satursday”,“sunday”,“wall”,“street”
The objective is to clean those are less relevant and very little meaning to find sentiment, such as punctuation, special character, prefix with number, hashtag, alias, links and custom terms above.
We removed duplicated text in tweets, sent from the same screen name multiple times. For instance, there are several news publishers have posted the same tweet on different days.
sessionInfo()sessionInfo()## R version 4.1.0 (2021-05-18)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] kableExtra_1.3.4 knitr_1.33 ggfortify_0.4.12 DT_0.19
## [5] glue_1.4.2 rtweet_0.7.0.9000 gtrendsR_1.4.8 gvlma_1.0.0.3
## [9] ggstance_0.3.5 scales_1.1.1 ggthemes_4.2.4 sandwich_3.0-1
## [13] car_3.0-11 carData_3.0-4 Hmisc_4.5-0 Formula_1.2-4
## [17] survival_3.2-13 lattice_0.20-44 psych_2.1.9 gridExtra_2.3
## [21] huxtable_5.4.0 jtools_2.1.4 textdata_0.4.1 reshape2_1.4.4
## [25] wordcloud_2.6 RColorBrewer_1.1-2 wordcloud2_0.2.1 tidytext_0.3.1
## [29] lubridate_1.7.10 modelr_0.1.8 broom_0.7.9 forcats_0.5.1
## [33] stringr_1.4.0 dplyr_1.0.7 purrr_0.3.4 readr_2.0.1
## [37] tidyr_1.1.4 tibble_3.1.5 ggplot2_3.3.5 tidyverse_1.3.1
##
## loaded via a namespace (and not attached):
## [1] colorspace_2.0-2 ellipsis_0.3.2 rio_0.5.27
## [4] htmlTable_2.2.1 base64enc_0.1-3 fs_1.5.0
## [7] rstudioapi_0.13 farver_2.1.0 SnowballC_0.7.0
## [10] fansi_0.5.0 xml2_1.3.2 splines_4.1.0
## [13] mnormt_2.0.2 jsonlite_1.7.2 cluster_2.1.2
## [16] dbplyr_2.1.1 png_0.1-7 compiler_4.1.0
## [19] httr_1.4.2 backports_1.2.1 assertthat_0.2.1
## [22] Matrix_1.3-4 fastmap_1.1.0 cli_3.1.0
## [25] htmltools_0.5.2 tools_4.1.0 gtable_0.3.0
## [28] rappdirs_0.3.3 Rcpp_1.0.7 cellranger_1.1.0
## [31] jquerylib_0.1.4 vctrs_0.3.8 svglite_2.0.0
## [34] nlme_3.1-152 crosstalk_1.1.1 xfun_0.25
## [37] openxlsx_4.2.4 rvest_1.0.1 lifecycle_1.0.1
## [40] pacman_0.5.1 zoo_1.8-9 hms_1.1.1
## [43] parallel_4.1.0 yaml_2.2.1 curl_4.3.2
## [46] pander_0.6.4 sass_0.4.0 rpart_4.1-15
## [49] latticeExtra_0.6-29 stringi_1.7.4 highr_0.9
## [52] tokenizers_0.2.1 checkmate_2.0.0 zip_2.2.0
## [55] systemfonts_1.0.2 commonmark_1.7 rlang_0.4.12
## [58] pkgconfig_2.0.3 evaluate_0.14 labeling_0.4.2
## [61] htmlwidgets_1.5.4 tidyselect_1.1.1 plyr_1.8.6
## [64] magrittr_2.0.1 R6_2.5.1 generics_0.1.1
## [67] DBI_1.1.1 mgcv_1.8-36 pillar_1.6.4
## [70] haven_2.4.3 foreign_0.8-81 withr_2.4.2
## [73] abind_1.4-5 nnet_7.3-16 janeaustenr_0.1.5
## [76] crayon_1.4.2 utf8_1.2.2 tmvnsim_1.0-2
## [79] tzdb_0.1.2 rmarkdown_2.10 jpeg_0.1-9
## [82] radarchart_0.3.1 grid_4.1.0 readxl_1.3.1
## [85] data.table_1.14.2 webshot_0.5.2 reprex_2.0.1
## [88] digest_0.6.28 munsell_0.5.0 viridisLite_0.4.0
## [91] bslib_0.2.5.1
Prof. Roh’s Note: “Please describe your individual contribution to the team’s project (in detail).”